Bayesian Biclustering of Gene Expression
نویسندگان
چکیده
Background: Biclustering of gene expression data searches for local patterns of gene expression. A bicluster (or a two-way cluster) is defined as a set of genes whose expression profiles are mutually similar within a subset of experimental conditions/samples. Although several biclustering algorithms have been studied, few are based on rigorous statistical models. Results: We developed a Bayesian biclustering model (BBC), and implemented a Gibbs sampling procedure for its statistical inference. We showed that Bayesian biclustering model can correctly identify multiple clusters of gene expression data. Using simulated data both from the model and with realistic characters, we demonstrated the BBC algorithm outperforms other methods in both robustness and accuracy. We also showed that the model is stable for two normalization methods, the interquartile range normalization and the smallest quartile range normalization. Applying the BBC algorithm to the yeast expression data, we observed that majority of the biclusters we found are supported by significant biological evidences, such as enrichments of gene functions and transcription factor binding sites in the corresponding promoter sequences. Conclusions: The BBC algorithm is shown to be a robust model-based biclustering method that can discover biologically significant gene-condition clusters in microarray data. The BBC model can easily handle missing data via Monte Carlo imputation and has the potential to be extended to integrated study of gene transcription networks. Background Clustering gene expression data has been an important problem in computational biology. While traditional clustering methods, such as hierarchical and K-means clustering, have been shown useful in analyzing microarray data, they have some limitations. First, a gene or an experimental condition can be assigned to only one cluster. Second, all genes and conditions have to be assigned to clusters. However, biologically a gene or a sample could participate in multiple biological pathways, and a cellular process is generally active only under a subset of genes or experimental conditions. A biclustering scheme that produces gene and condition/sample clusters simultaneously can model the situation where a gene or a condition is from The 2007 International Conference on Bioinformatics & Computational Biology (BIOCOMP'07) Las Vegas, NV, USA. 25-28 June 2007 Published: 20 March 2008 BMC Genomics 2008, 9(Suppl 1):S4 doi:10.1186/1471-2164-9-S1-S4 The 2007 International Conference on Bioinformatics & Computational Biology (BIOCOMP'07) Jack Y Jang, Mary Qu Yang, Mengxia (Michelle) Zhu, Youping Deng and Hamid R Arabnia Research This article is available from: http://www.biomedcentral.com/1471-2164/9/S1/S4 © 2008 Gu and Liu; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. BMC Genomics 2008, 9(Suppl 1):S4 http://www.biomedcentral.com/1471-2164/9/S1/S4 Page 2 of 10 (page number not for citation purposes) involved in several biological functions. Furthermore, a biclustering model can avoid those “noise” genes that are not active in any experimental condition. Biclustering of microarray data was first introduced by Cheng and Church [1]. They defined a residual score to search for submatrices as biclusters. This is a heuristic method and can not model the cases where two biclusters overlap with each other. Segal et al. [2] proposed a modified version of one-way clustering using a Bayesian model in which genes can belong to multiple clusters or none of the clusters. But it can not simultaneously cluster conditions/samples. Tseng and Wong developed a tight clustering algorithm [3]. It allows some of the genes not to be clustered, but does not select conditions. Bergmann et al [4] introduced the iterative signature algorithm (ISA), which searches bicluster modules iteratively based on two pre-determined thresholds. ISA can identify multiple biclusters, but is highly sensitive to the threshold values and tends to select a strong bicluster many times. The plaid model [5] introduces a statistical model assuming that the expression value in a bicluster is the sum of the main effect, the gene effect, the condition effect, and the noise term, i.e.: where noise !ij ~ N(0, "2). It further assumes that the expression values of two overlapping biclusters are the sum of the two module effects. The plaid model uses a greedy search strategy, so errors can accumulate easily. Also in multiple clusters case, the clusters identified by the algorithm tend to overlap to a great extent. Tanay et al. [6] proposed a SAMBA biclustering scheme using bipartite graphs containing both conditions and genes. Ben-Dor et al. [7] attempted to identify order-preserving sub matrices (OPSMs). Murali and Kasif [8] discretized gene expression data into several symbols and searched for conservative symbol motifs (xMOTIFs). A survey of different biclustering methods can be found in [9]. We here propose a Bayesian biclustering (BBC) model. For a single bicluster, we assume the same model as in the plaid model [5], as described in equation (1). But for multiple clusters, we constrain the overlapping of biclusters to only one direction (i.e., either gene or condition direction). Besides, we use a more flexible error model, which allows the error term of each cluster to have to a different variance. To make the Bayesian inference of biclusters, we implemented an efficient Gibbs sampling algorithm with all effect parameters (except the error variances) integrated out. We compared the performance of the BBC algorithm for several different types of simulated datasets with that of the plaid model [5], the ISA [4], the method of Cheng and Church [1], the SAMBA method [6] and the OPSMs [7]. Finally, we applied the BBC algorithm to the yeast expression dataset and identified many biologically significant biclusters. Results and discussion Simulation results Bayesian biclustering in various simulated scenarios We simulated a dataset with 400 genes and 50 samples. The background data is i.i.d. from N(0, 0.5). Two clusters of 100 genes and 15 conditions are simulated according to the BBC model with main effects, gene effects, condition effects and error terms as #1 ~ N(5, 0.5), #2 ~ N(7, 0.5), $i1, $i2 ~ N(0, 0.5), %j1, %j2 ~ N(0, 0.5) and !ij1 ~ N(0, 0.5), !ij2 ~ N(0, 0.7). We considered three scenarios for datasets with two clusters: the two clusters have some common conditions but distinct genes (Figure 1(a)); the two clusters have some common genes but distinct conditions ( Figure 1(d)); and two clusters have both common genes and conditions (Figure 1(h)), in which case an additive model is assumed for the overlapping part. The results from using a nonoverlapping gene version of the BBC model are shown in Figures 1(b)-(c),(e)-(g),(i)-(k). In all cases the BBC model identified the genes and conditions of the simulated clusters correctly, but grouped them slightly differently because of our model constraints. Comparison of biclustering algorithms on data simulated from statistical models We compared six biclustering methods: the BBC method, the plaid model, ISA, SAMBA, OPSMs, and Cheng and Church's biclustering (CC). We considered both the single cluster case and the multiple clusters case using simulated data from the plaid model. A single cluster dataset is shown in Figure 2(a). The 400 × 50 background noise matrix is simulated according to i.i.d. normal N(0,0.5). We superimposed a cluster of size 100 × 20 according to the plaid model with #1 ~ N(5, 0.5) and $i1, %j1 ~ N(0, 0.5). The multiple cluster case is shown in Figure 2(b). The background is the same as above. Two clusters of size 100 × 15 are also simulated according to the plaid model with #1 ~ N(5,0.5), #2 ~ N(7, 0.5) and $i1, %ji, $i2, %j2 ~ N(0, 0.5). An additive model is used for the overlapping part of the two clusters. Since each method searches for biclusters with different structures, comparing biclustering results is not very straightforward. In order to carry out a comprehensive comparison among various biclustering results for simulated datasets, we use the following four characteristics: sensitivity, specificity, overlapping rate, and number of clusters. Since we know which gene-condition combination belongs to the true biclusters, we use the standard definition for sensitivity and specificity, both of which are yij i j ij = + + + # $ % ! , BMC Genomics 2008, 9(Suppl 1):S4 http://www.biomedcentral.com/1471-2164/9/S1/S4 Page 3 of 10 (page number not for citation purposes) values between 0 and 1. A higher sensitivity suggests that more “true” members of the clusters have been identified by the algorithm, while a higher specificity suggests that more background data points are excluded from the clusters. The overlapping rate is defined as Thus, if there is no overlap between the identified clusters, the overlapping rate is 0. On the other hand, if the identified clusters greatly overlap with each other, the overlapping rate is close to 1. We used the BicAT software package [10] for ISA, CC, and OPSMs. Different gene and condition thresholds are used for the ISA. We carefully chose a set of thresholds with good performance and then slightly changed the thresholds to test the stability of the ISA. We used default settings for CC's model. The plaid algorithm was implemented using the 1− # of matrix entries in the union of identified clusters #of entries in each identified cluster all clusters ∑ Simulated data with two biclusters and the results of the BBC analysis Fig re 1 Simulated data with two biclusters and the results of the BBC analysis. Bayesian biclustering for simulated datasets. (a) A dataset with two non-overlapping clusters. (b)-(c) The two clusters found by the Bayesian biclustering model from (a). (d) A dataset with two clusters with common genes. (e)-(g) The three clusters found by the Bayesian biclustering model from (d). (h) A dataset with two clusters with both common samples and common genes. (i)-(k) The three clusters found by the Bayesian biclustering model from (h). (a) (b) (c)
منابع مشابه
Gene co-expression networks via biclustering Differential gene co-expression networks via Bayesian biclustering models
Identifying latent structure in large data matrices is essential for exploring biological processes. Here, we consider recovering gene co-expression networks from gene expression data, where each network encodes relationships between genes that are locally co-regulated by shared biological mechanisms. To do this, we develop a Bayesian statistical model for biclustering to infer subsets of co-re...
متن کاملDifferential gene co-expression networks via Bayesian biclustering models
Identifying latent structure in large data matrices is essential for exploring biological processes. Here, we consider recovering gene co-expression networks from gene expression data, where each network encodes relationships between genes that are locally co-regulated by shared biological mechanisms. To do this, we develop a Bayesian statistical model for biclustering to infer subsets of co-re...
متن کاملContext Specific and Differential Gene Co-expression Networks via Bayesian Biclustering
Identifying latent structure in high-dimensional genomic data is essential for exploring biological processes. Here, we consider recovering gene co-expression networks from gene expression data, where each network encodes relationships between genes that are co-regulated by shared biological mechanisms. To do this, we develop a Bayesian statistical model for biclustering to infer subsets of co-...
متن کاملSparse group factor analysis for biclustering of multiple data sources
MOTIVATION Modelling methods that find structure in data are necessary with the current large volumes of genomic data, and there have been various efforts to find subsets of genes exhibiting consistent patterns over subsets of treatments. These biclustering techniques have focused on one data source, often gene expression data. We present a Bayesian approach for joint biclustering of multiple d...
متن کاملFABIA: factor analysis for bicluster acquisition
MOTIVATION Biclustering of transcriptomic data groups genes and samples simultaneously. It is emerging as a standard tool for extracting knowledge from gene expression measurements. We propose a novel generative approach for biclustering called 'FABIA: Factor Analysis for Bicluster Acquisition'. FABIA is based on a multiplicative model, which accounts for linear dependencies between gene expres...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009